1 Introduction
Probabilistic graphical models (Lauritzen 1996; Koller and Friedman 2009) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models reduce the number of parameters by selecting the most relevant interactions between variables. Undirected Gaussian Graphical Models (GGMs) are a class of graphical models used in Gaussian settings. In the context of high-dimensional statistics, graphical models are generally assumed sparse, meaning that a small number of variables interact, compared to the total number of possible interactions. This assumption has been shown to provide both statistical and computational advantages by simplifying the structure of dependence between variables (Dempster 1972) and allowing efficient algorithms (Meinshausen and Bühlmann 2006). See, for instance, (Fan, Liao, and Liu 2016) for a review about sparse graphical models inference.
In GGMs, it is well known (see, e.g., Lauritzen (1996)) that inferring the graphical model or equivalently the Conditional Independence Graph (CIG) boils down to inferring the support of the precision matrix \mathbf{\Omega} (the inverse of the variance-covariance matrix). To do so, several \ell_1 penalized methods have been proposed in the literature to learn the CIG of GGMs. For instance, the neighborhood selection (Meinshausen and Bühlmann 2006) (MB) based on a nodewise regression approach via the least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)) is a popular method. Each variable is regressed on the others, taking advantage of the link between the so-obtained regression coefficients and partial correlations. More precisely, for all 1 \le i \le p, the following problem is solved:
\hat{\boldsymbol{\beta}^i}(\lambda) = \underset{\boldsymbol{\beta}^i \in \mathbb{R}^{p-1}}{\operatorname{argmin}} \frac{1}{n} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda \left \lVert \boldsymbol{\beta}^i \right \rVert_1. \tag{1}
In Equation 1, \lambda is a non negative regularization parameter and \mathbf{X}^{\setminus i} denotes the matrix \mathbf{X} deprived of column i. The MB method defined by the estimation problem
Another family of sparse CIG inference methods directly estimates \mathbf{\Omega} via direct minimization of the \ell_1-penalized negative log-likelihood (Banerjee, El Ghaoui, and d’Aspremont 2008), without resorting to the auxiliary regression problem. This method, called the graphical lasso (Friedman, Hastie, and Tibshirani 2007), benefits from many optimization algorithms (Yuan and Lin 2007; Rothman et al. 2008; Banerjee, El Ghaoui, and d’Aspremont 2008; Hsieh et al. 2014).
Such inference methods are widely used and enjoy many favorable theoretical and empirical properties, including robustness to high-dimensional problems. However, some limitations have been observed, particularly in the presence of strongly correlated variables. These limitations are caused by known impairments of Lasso-type regularization in this context (Bühlmann et al. 2012; Park, Hastie, and Tibshirani 2006). To overcome this, in addition to sparsity, several previous works attempt to estimate CIG by integrating clustering structures among variables for the sake of both statistical sanity and interpretability. A non-exhaustive list of works that integrate a clustering structure to speed up or improve the estimation procedure includes (Honorio et al. 2009; Ambroise, Chiquet, and Matias 2009; Mazumder and Hastie 2012; Tan, Witten, and Shojaie 2013; Yao and Allen 2019; Devijver and Gallopin 2018).
The above methods exploit the group structure to simplify the graph inference problem and infer the CIG between single variables. Another question that has received less attention is the inference of the CIG between the groups of variables, i.e., between the meta-variables representative of the group structure. A recent work introducing inference of graphical models on multiple grouping levels is (Cheng, Shan, and Kim 2017). They proposed inferring the CIG of gene data on two levels corresponding to genes and pathways, respectively. Note that pathways are groups of functionally related genes known in advance. The inference is achieved by optimizing a penalized maximum likelihood that estimates a sparse network at both gene and group levels. Our work is also part of this dynamic. We introduce a penalty term allowing parsimonious networks to be built at different hierarchical clustering levels. The main difference with the procedure of (Cheng, Shan, and Kim 2017) is that we do not require prior knowledge of the group structure, which makes the problem significantly more complex. In addition, our method has the advantage of proposing CIGs at more than two levels of granularity.
We introduce the Multiscale Graphical Lasso (MGLasso), a novel method to estimate simultaneously a hierarchical clustering structure, and graphical models depicting the conditional independence structure between clusters of variables at each level of the hierarchy. The procedure is based on a convex optimization problem with a hybrid penalty term combining a graphical Lasso and a group-fused Lasso penalty. In the spirit of convex hierarchical clustering, introduced by (Hocking et al. 2011; Lindsten, Ohlsson, and Ljung 2011), the hierarchy is obtained by spanning the entire regularization path. At each level of the hierarchy, variables in the same clusters are represented by a meta-variable; the new CIG is estimated between these new meta-variables, leading to a multiscale graphical model. Unlike (Yao and Allen 2019), who introduced convex clustering in GGMs, our approach is expected to produce sparse estimations, thanks to an additional \ell_1 penalty.
The remainder of this paper is organized as follows. In section Multiscale Graphical Lasso, we formally introduce the Multiscale Graphical Lasso based on a convex estimation problem and an optimization algorithm based on the continuation of Nesterov’s smoothing technique (Hadj-Selem et al. 2018). Section Simulation experiments presents numerical results on simulated and real data.
2 Multiscale Graphical Lasso
The proposed method aims at inferring a graphical Gaussian model while hierarchically grouping variables. It infers conditional independence between different groups of variables. The approach is based on neighborhood selection (Meinshausen and Bühlmann 2006) and considers an additional fused-Lasso type penalty for clustering. In the spirit of hierarchical convex clustering, the hierarchical structure is recovered by spanning the regularization path.
Let X = (X^1, \dots, X^p)^T \in \mathbb R^p be a p-dimensional Gaussian random vector, with mean vector \mu and covariance matrix \mathbf \Sigma. The conditional independence structure of X is characterized by a graph G = (V, E), where V = \{1,\ldots p\} is the set of variables and E the set of edges, uniquely determined by the support of the precision matrix \mathbf{\Omega} = \mathbf{S}^{-1} (see, e.g., Dempster (1972)). In other words, for any two vertices i,j\in V, the edge (i,j) belongs to the set E if and only if \Omega_{ij} \neq 0, that is if and only if the i-th and j-th variables are conditionally independent given all the others i.e. X^i \perp \!\!\! \perp X^j |X^{\setminus (i, j)} where X^{\setminus (i, j)} is the set of all p variables deprived of variables i and j.
Considering the linear model X^i = \sum_{j\neq i} \beta^i_j X_j + \epsilon_i where \epsilon_i is a Gaussian centered random variable, we have \beta^i_j = -\frac{\Omega_{ij}}{\Omega_{ii}}. We define the regression matrix \boldsymbol{\beta} := [{\boldsymbol{\beta}^1}^T, \ldots, {\boldsymbol{\beta}^p}^T]^T \in \mathbb{R}^{p \times (p-1)}, whose rows are the regression vectors for each of the p regressions.
Let the n \times p-dimensional matrix \mathbf{X} contain n independent observations of X. We propose to minimize the following criterion which combines Lasso and group-fused Lasso penalties:
J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}; \mathbf{X} ) = \frac{1}{2} \sum_{i=1}^p \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{2}
where \tau_{ij} is a permutation exchanging the coefficients \boldsymbol{\beta}^j_j and \boldsymbol{\beta}^j_i and leaves other coefficients untouched, \mathbf{X}^{i}\in \mathbb{R}^n denotes the i-th column of \mathbf{X}, \boldsymbol{\beta}_{i} denotes the i-th row of \beta, \lambda_1 and \lambda_2 are penalization parameters. Let us consider
\hat{\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}}{\operatorname{argmin}} J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}, \mathbf{X}).
The lasso penalty term encourages sparsity and the penalty term \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 encourages to fuse regression vectors \boldsymbol{\beta}^i and \boldsymbol{\beta}^j. These fusions uncover a clustering structure. The model is likely to cluster together variables that have the same conditional effects on the others. Variables X^i and X^j are assigned to the same cluster when \boldsymbol{\beta}^i = \tau_{ij}(\boldsymbol{\beta}^j).
Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.
The greater the regularization weight \lambda_2, the larger groups become. This is the core principle of the convex relaxation of hierarchical clustering introduced by Hocking et al. (2011). Hence, we can derive a hierarchical clustering structure by spanning the regularization path obtained by varying \lambda_2 while \lambda_1 is fixed. The addition of a fused-type term in graphical models inference has been studied previously by authors such as Honorio et al. (2009), Ganguly and Polonik (2014), Grechkin et al. (2015). However, these existing methods require prior knowledge of the neighborhood of each variable. On the contrary, our approach allows simultaneous inference of a multi-level graphical model and a hierarchical clustering of the variables.
In practice, if some information about the clustering structure is available, the problem can be generalized into:
\min_{\boldsymbol{\beta}} \sum_{i=1}^p\frac{1}{2} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} w_{ij} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2, \tag{3}
where w_{ij} is a positive weight encoding prior knowledge of the groups to which variables i and j belong to. consider
3 Numerical scheme
This section introduces a complete numerical scheme to apply MGLasso in practice, using a convex optimization algorithm and a model selection procedure. Section Optimization via CONESTA algorithm reviews the principles of the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), the optimization algorithm used in practice to solve MGLasso. Section Reformulation of MGLasso for CONESTA algorithm details a reformulation of MGLasso, which enables us to apply CONESTA. Finally, Section Model selection presents the procedure used to select the regularization parameters.
3.1 Optimization via CONESTA algorithm
The optimization problem for Multiscale Graphical Lasso is convex but not straightforward to solve using classical algorithms because of the fused-lasso type penalty, which is non-separable and admits no closed-form solution for the proximal gradient. We rely on the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), dedicated to high-dimensional regression problems with structured sparsity such as group structures.
The CONESTA solver, initially introduced for neuro-imaging problems, addresses a general class of convex optimization problems which includes group-wise penalties, admitting loss functions of the form:
f(\boldsymbol{\theta} ) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}), \tag{4}
where \lambda_1 and \lambda_2 are penalty weights, and \boldsymbol{\theta}\in \mathbb{R}^d is a d-dimensional vector of parameters to optimize. In the original paper (Hadj-Selem et al. 2018), the function g(\boldsymbol{\theta}) is the sum of a least squares criterion and a ridge penalty, h(\boldsymbol{\theta}) is a penalty whose proximal operator is known in closed-form, and s(\boldsymbol{\theta}) is an \ell_{1,2} penalty of the form
s(\boldsymbol{\theta}) = \sum_{\phi \in \Phi} \|\mathbf{D}_\phi \boldsymbol{\theta}_\phi\|_2. In the definition of s(\boldsymbol{\theta}),
\Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\} is a set of subsets of indices, i.e., \Phi_i\subset \{1,\ldots, d\} for all i\in\{1,\ldots,\operatorname{Card}(\Phi)\} and, for all \phi\in \Phi, \boldsymbol{\theta}_\phi is the sub-vector of \boldsymbol{\theta} defined by \boldsymbol{\theta}_\phi = (\theta_i)_{i\in\phi}. Finally, \mathbf{D}_\phi are linear operators. The main ingredient of CONESTA is the approximation of the non-smooth \ell_{2,1}-norm penalty with unknown proximal gradient, by a smooth function with known proximal gradient computed using Nesterov’s smoothing. Given a smoothing parameter \mu>0, let us define the smooth approximation
s_{\mu}(\boldsymbol{\theta}) = \max_{\boldsymbol{\alpha} \in \mathcal{K}} \left \{ \boldsymbol{\alpha}^T \mathbf{D} \boldsymbol{\theta} - \frac{\mu}{2} \| \boldsymbol{\alpha} \|_2^2 \right \}, where \mathcal{K} is the \ell_2 unit ball. Note that
\lim_{\mu \rightarrow 0} s_{\mu}(\boldsymbol{\theta}) = s(\boldsymbol{\theta}). An accelerated proximal gradient algorithm (FISTA, Beck and Teboulle (2009)) step can then be applied after computing the gradient of the smooth part of the approximated criterion which is given by g(\boldsymbol{\theta}) + s_{\mu}(\boldsymbol{\theta}). At each iteration, the smoothing parameter \mu is updated dynamically using the duality gap, and a new approximation is computed. The CONESTA algorithm enjoys a linear convergence rate, and was shown empirically to outperform other computational options for structured-sparsity problems such as ADMM and inexact FISTA in terms of convergence speed (Hadj-Selem et al. 2018).
3.2 Reformulation of MGLasso for CONESTA algorithm
To apply CONESTA, it is necessary to reformulate the MGLasso problem in order to comply with the form of loss function required by CONESTA. The objective of MGLasso can indeed be written as
\operatorname{argmin} \frac{1}{2} ||\mathbf{Y} - \tilde{\mathbf{X}} \tilde{\boldsymbol{\beta}}||_2^2 + \lambda_1 ||\tilde{\boldsymbol{\beta}}||_1 + \lambda_2 \sum_{i<j} ||\boldsymbol A_{ij} \tilde{\boldsymbol{\beta}}||_2, \tag{5}
where \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol A_{ij} is a p\times p(p-1) matrix defined by
\boldsymbol A_{ij} (k, l) = \begin{cases} 1, \ \text{if} \ l =(i-1)p+k, \\ -1, \ \text{if} \ l = (j-1)p+k,\\ 0, \ \text{otherwise.} \end{cases}
Note that we introduce this notation for simplicity of exposition, but, in practice, the sparsity of the matrices A_{ij} allows a more efficient implementation. Based on reformulation Equation 5, we may apply CONESTA to solve the objective of MGLasso for fixed \lambda_1 and \lambda_2. The procedure is applied, for fixed \lambda_1, to a range of decreasing values of \lambda_2 to obtain a hierarchical clustering. The corresponding pseudo-code is given in the following algorithm where (\mathbf{X}^i)^{+} denotes the pseudo-inverse of \mathbf{X}^i and \epsilon_{fuse} the threshold for merging clusters.
\begin{algorithm}
\caption{MGLasso algorithm}
\begin{algorithmic}
\STATE \textbf{Input}: data $\mathbf{X} \in \mathbb R^{n\times p}$, $\lambda_1$, starting value $\lambda_2$, step $\kappa > 1$ \\ $\operatorname{clusters} \leftarrow \left \{\{1\}, \dots, \{p\}\right \}$ \\ $\boldsymbol{\beta}^i \leftarrow (\mathbf{X}^i)^{+}\mathbf{X}^i$
\WHILE{$\operatorname{Card}(\operatorname{clusters}) \ge 2$}
\STATE $\boldsymbol{\beta} \leftarrow CONESTA(\mathbf{X}, \boldsymbol A, \lambda_1, \lambda_2)$
\FOR{$i = 1$ \TO $p$}
\FOR{$j = i$ \TO $p$}
\IF{$\operatorname{dist}(\boldsymbol{\beta}^i, \boldsymbol{\beta}^j) < \epsilon_{fuse}$}
\STATE $\operatorname{clusters}_i = \operatorname{clusters}_j$
\ENDIF
\ENDFOR
\ENDFOR
\STATE \textbf{Update}: $\operatorname{clusters}$
\STATE $\lambda_2 \leftarrow \lambda_2 \times \kappa$
\ENDWHILE
\STATE \textbf{Return} $\boldsymbol{\beta}$ for each couple $(\lambda_1, \lambda_2$)
\end{algorithmic}
\end{algorithm}
3.3 Model selection
A crucial question for practical applications is the definition of a rule to select the penalty parameters (\lambda_1, \lambda_2). This selection problem operates at two levels: \lambda_1 controls the sparsity of the graphical model, and \lambda_2 controls the number of clusters in the optimal clustering partition. These two parameters are dealt with separately: the sparsity parameter \lambda_1 is chosen via model selection, while the clustering parameter \lambda_2 varies across a grid of values, in order to obtain graphs with different levels of granularity. The problem of model selection in graphical models is difficult in the high dimensional case where the number of samples is small compared to the number of variables, as classical AIC and BIC criteria tend to perform poorly (Liu, Roeder, and Wasserman 2010). Alternative criteria have been proposed in the literature, such as cross-validation (Bien and Tibshirani 2011), GGMSelect (Giraud, Huet, and Verzelen 2012), stability selection (Meinshausen and Bühlmann 2010; Liu, Roeder, and Wasserman 2010), Extended Bayesian Information Criterion (EBIC) (Foygel and Drton 2010), and Rotation Information Criterion (Zhao et al. 2012).
In this paper, we focused on the StARS stability selection approach proposed by Liu, Roeder, and Wasserman (2010). The method uses k subsamples of data to estimate the associated graphs for a given range of \lambda_1 values. For each value, a global instability of the graph edges is computed. The optimal value of \lambda_1 is chosen so as to minimize the instability, as follows. Let \lambda^{(1)}_1, \dots, \lambda_1^{(K)} be a grid of sparsity regularization parameters, and S_1, \dots, S_N be N bootstrap samples obtained by sampling the rows of the data set \mathbf{X}. For each k\in\{1,\ldots,K\} and for each j\in\{1,\ldots, N\}, we denote by \mathcal{A}^{k,j}(\mathbf{X}) the adjacency matrix of the estimated graph obtained by applying the inference algorithm to S_n with regularization parameter \lambda_1^{(k)}. For each possible edge (s,t)\in\{1,\ldots,p\}^2, the probability of edge appearance is estimated empirically by
\hat \theta_{st}^{(k)} = \frac{1}{N} \sum_{j=1}^N \mathcal{A}^{k,j}_{st}.
Define
\hat \xi_{st}(\Lambda) = 2 \hat \theta_{st} (\Lambda) \left ( 1 - \hat \theta_{st} (\Lambda) \right )
the empirical instability of edge (s,t) (that is, twice the variance of the Bernoulli indicator of edge (s,t)). The instability level associated to \lambda_1^{(k)} is given by
\hat D(\lambda_1^{(k)}) = \frac{\sum_{s<t} \hat \xi_{st}(\lambda_1^{(k)})}{p \choose 2}, StARS selects the optimal penalty parameter as follows
\hat \lambda = \max_k\left\{ \lambda_1^{(k)}: \hat D(\lambda_1^{(k)}) \le \beta, k\in\{1,\ldots,K\} \right \},
where \beta is the threshold chosen for the instability level.
4 Simulation experiments
In this section, we conduct a simulation study to evaluate the performance of the MGLasso method, both in terms of clustering and support recovery. Receiver Operating Characteristic (ROC) curves are used to evaluate the adequacy of the inferred graphs with the reality for the MGLasso and GLasso methods in the Erdös-Renyi, Scale-free, and Stochastic Block Models frameworks. The Adjusted Rand indices are used to compare the partitions obtained with MGLasso, hierarchical agglomerative clustering, and K-means clustering in a stochastic block model framework.
4.1 Synthetic data models
We consider three different synthetic network models: the Stochastic Block Model (SBM, Fienberg and Wasserman (1981)), the Erdös-Renyi model (Erdős, Rényi, et al. 1960) and the Scale-Free model (Newman, Strogatz, and Watts 2001). In each case, Gaussian data is generated by drawing n independent realizations of a multivariate Gaussian distribution \mathcal N(0, \mathbf{\Sigma}) where \mathbf{\Sigma} \in \mathbb{R}^{p \times p} and \mathbf{\Omega} = \mathbf{\Sigma} ^{-1}. The support of \mathbf{\Omega}, equivalent to the network adjacency matrix, is generated from the three different models. The difficulty level of the problem is controlled by varying the ratio \frac{n}{p} with p fixed at 40: \frac{n}{p}\in \{0.5,1,2\}.
4.1.1 Stochastic Block-Model
We construct a block-diagonal precision matrix \mathbf{\Omega} as follows. First, we generate the support of \mathbf{\Omega} as shown in Figure 1, denoted by \boldsymbol A\in\{0,1\}^{p\times p}. To do this, the variables are first partitioned into K = 5 hidden groups, noted C_1, \dots, C_K described by a latent random variable Z_i, such that Z_i = k if i = C_k. Z_i follows a multinomial distribution
P(Z_i = k) = \pi_k, \quad \forall k \in \{1, \dots, K\}, where \pi = (\pi_1, \dots, \pi_k) is the vector of proportions of clusters whose sum is equal to one. The set of latent variables is noted \mathbf{Z} = \{ Z_1, \dots, Z_K\}. Conditionally to \mathbf{Z}, A_{ij} follows a Bernoulli distribution such that
A_{ij}|Z_i = k, Z_j = l \sim \mathcal{B}(\alpha_{kl}), \quad \forall k,l \in \{1 \dots, K\},
where \alpha_{kl} is the probability of inter-cluster connectivity, with \alpha_{kl} = 0.01 if k\neq l and \alpha_{ll} = 0,75. For k\in\{1,\ldots, K\}, we define p_k = \sum_{i=1}^p \boldsymbol{1}_{\{Z_i = k\}}. The precision matrix \mathbf{\Omega} of the graph is then calculated as follows. We define \Omega_{ij} = 0 if Z_i\neq Z_j ; otherwise, we define \Omega_{ij} = A_{ij}\omega_{ij} where, for all i\in\{1,\ldots,p\} and for all j\in\{1,\ldots,p| Z_j = Z_i\}, \omega_{ij} is given by :
If \alpha_{ll} were to be equal to one, this construction of \mathbf{\Omega} would make it possible to control the level of correlation between the variables in each block to \rho. Introducing a more realistic scheme with \alpha_{ll}=0.75 allows only to have an approximate control.
Show the code
path_checks <- "./rscripts/mglasso_functions/"
source(paste0(path_checks, "simulate.R"))
source(paste0(path_checks, "plot.R"))
bloc_diag <- function(n_vars, connectivity_mat, prop_clusters, rho) {
true_nclusters <- 0
while (true_nclusters != length(prop_clusters)){ ## controle pour s'assurer qu'il y a bien le nombre de clusters voulu
network <- simone::rNetwork(n_vars, connectivity_mat, prop_clusters)
true_nclusters <- length(unique(network$clusters))
}
precision_mat <- network$A[order(network$clusters), order(network$clusters)]
eff_clusters <- table(network$clusters)
b = -rho/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))
d = (1+rho*(eff_clusters-2))/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))
for (i in 1:length(eff_clusters)) {
#bloc <- matrix(b[i], eff_clusters[i], eff_clusters[i])
temp <- precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)]
temp <- as.matrix(temp)
temp[temp != 0] = b[i]
diag(temp) = d[i]
precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] = temp
}
flag <- min(svd(precision_mat)$d)
if(flag < 0){
diag(precision_mat) <- 1 - flag
}
return(precision_mat)
}
#' simulate data with given graph structure
#'
#' @param p
#' @param np_ratio
#' @param structure
#' @param alpha
#' @param prob_mat
#' @param rho
#' @param g
#' @param inter_cluster_edge_prob
#' @param p_erdos
#' @param verbose
#'
#' @return A list:
#' graph : precision
sim_data <- function(p = 20,
np_ratio = 2,
structure = c("block_diagonal", "hub", "scale_free", "erdos"),
alpha,
prob_mat,
rho,
g,
inter_cluster_edge_prob = 0.01,
p_erdos = 0.1,
verbose = FALSE){
structure <- match.arg(structure)
n = round(np_ratio * p)
switch (structure,
erdos = {
network <- simone::rNetwork(p, p_erdos, 1)
correlation <- solve(network$Theta)
X <- mvtnorm::rmvnorm(n, sigma = correlation)
graph <- network$Theta
if(verbose) message("Data, precision matrix and graph generated from Erdos structure.")
},
hub = {
L = huge::huge.generator(graph = "hub", g = 5, d = p, n = n, verbose = FALSE)
X=L$data
graph = L$omega
correlation = L$sigma
if(verbose) message("Data, precision matrix and graph generated from Hub structure.")
},
scale_free = {
L = huge::huge.generator(graph = "scale-free", d = p, n = n, verbose = FALSE) ## d edges graph
X=L$data
graph = L$omega
correlation = L$sigma
if(verbose) message("Data, precision matrix and graph generated from Scale free structure.")
},
block_diagonal = {
if(inter_cluster_edge_prob != 0) { # inter-clusters edges added in the flipped way
flag <- TRUE
while(flag) { # To do: treat the flag case instead of generating a new model
K <- bloc_diag(p, prob_mat, alpha, rho)
target_indices <- which(K[upper.tri(K)] == 0) # Random selection of edges to be set to 1
select_len <- round(length(target_indices) * inter_cluster_edge_prob)
selected_indices <- sample(target_indices, select_len)
precision_level <- unique(K[upper.tri(K)])
precision_level <- max(precision_level[precision_level != 0]) # je prends la precision max à l'intérieur des blocs
# et je l'affecte dans les arêtes inter-blocs
K[upper.tri(K)][selected_indices] <- precision_level
K <- as.matrix(Matrix::forceSymmetric(K, uplo = "U"))
flag <- any(eigen(K)$values <= 0) # Control of positive definiteness
}
correlation <- solve(K)
graph = K
X <- mvtnorm::rmvnorm(n, sigma = correlation)
if(verbose) message("Data, precision matrix and graph generated from block-diagonal
structure with inter-clusters edges.")
}
else { # Only intra-cluster edges while approximately controlling correlation level
K <- bloc_diag(p, prob_mat, alpha, rho)
correlation <- solve(K)
graph = K
X <- mvtnorm::rmvnorm(n, sigma = correlation)
if(verbose) message("Data, precision matrix and graph generated from block-diagonal
structure with only intra-clusters edges.")
}
}
)
return(list(X=X, graph=graph, correlation=correlation))
}
adj_mat <- function(mat) {
diag(mat) <- 0
mat[ abs(mat) < 1e-10] <- 0
mat[mat != 0] <- 1
return(mat)
}
set.seed(2020)
sim_sbm <- sim_data(p = 40, structure = "block_diagonal",
alpha = rep(1/5, 5),
prob_mat = diag(0.75, 5),
rho = 0.2, inter_cluster_edge_prob = 0.01)
gsbm <- adj_mat(sim_sbm$graph)
Matrix::image(as(gsbm, "sparseMatrix"),
sub = "", xlab = "", ylab = "") 4.1.2 Erdös-Renyi Model
The Erdös-Renyi model is a special case of the stochastic block model where \alpha_{kl} = \alpha_{ll} = \alpha is constant. We set the density \alpha of the graph to 0.1; see Figure 2 for an example of the graph resulting from this model.
Show the code
set.seed(2022)
sim_erdos <- sim_data(p = 40, structure = "erdos", p_erdos = 0.1)
gerdos <- adj_mat(sim_erdos$graph)
Matrix::image(as(gerdos, "sparseMatrix"),
sub = "", xlab = "", ylab = "")4.1.3 Scale-free Model
The Scale-free Model generates networks whose degree distributions follow a power law. The graph starts with an initial chain graph of 2 nodes. Then, new nodes are added to the graph one by one. Each new node is connected to an existing node with a probability proportional to the degree of the existing node. We set the number of edges in the graph to 40. An example of scale-free graph is shown in Figure 3.
Show the code
set.seed(2022)
sim_sfree <- sim_data(p = 40, structure = "scale_free")
gsfree <- adj_mat(sim_sfree$graph)
Matrix::image(as(gsfree, "sparseMatrix"),
sub = "", xlab = "", ylab = "")4.2 Support recovery
We compare the network structure learning performance of our approach to that of GLasso in its neighborhood selection version using ROC curves. In both GLasso and MGLasso, the sparsity is controlled by a regularization parameter \lambda_1; however, MGLasso admits an additional regularization parameter, \lambda_2, which controls the strength of convex clustering. To compare the two methods, in each ROC curve, we vary the parameter \lambda_1 while the parameter \lambda_2 (for MGLasso) is kept constant. We computed ROC curves for 4 different penalty levels for the \lambda_2 parameter; since GLasso does not depend on \lambda_2, the GLasso ROC curves are replicated.
In a decision rule associated with a sparsity penalty level \lambda_1, we recall the definition of the two following functions. The sensitivity, also called the true positive rate or recall, is given by : \begin{align*} \lambda_1 &\mapsto \text{sensitivity}(\lambda_1) = \frac{TP(\lambda_1)}{TP(\lambda_1) + FN(\lambda_1)}. \end{align*} Specificity, also called true negative rate or selectivity, is defined as follows: \begin{align*} \lambda_1 &\mapsto \text{specificity}(\lambda_1) = \frac{TN(\lambda_1)}{TN(\lambda_1) + FP(\lambda_1)}. \end{align*} The ROC curve with the parameter \lambda_1 represents \text{sensitivity}(\lambda_1) as a function of 1 - \text{specificity}(\lambda_1) which is the false positive rate.
For each configuration (n, p fixed), we generate 50 replications and their associated ROC curves, which are then averaged. The average ROC curves for the three models are given in Figure 4, Figure 5 and Figure 6 by varying \frac{n}{p}\in \{0.5,1,2\}.
Show the code
load("./data/roc_dtf_erdos.RData")
ggplot(roc_dtf_erdos, aes(x = fpr,
y = tpr,
color = method )) +
geom_line() +
facet_grid(np ~ tv) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("") +
scale_colour_grey(start = 0.2, end = 0.6) +
theme_bw()Show the code
load("./data/roc_dtf_sfree.RData")
ggplot(roc_dtf_sfree, aes(x = fpr,
y = tpr,
color = method )) +
geom_line() +
facet_grid(np ~ tv) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("") +
scale_colour_grey(start = 0.2, end = 0.6) +
theme_bw()Show the code
load("./data/roc_dtf_sbm.RData")
ggplot(roc_dtf_sbm, aes(x = fpr,
y = tpr,
color = method )) +
geom_line() +
facet_grid(np ~ tv) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("") +
scale_colour_grey(start = 0.2, end = 0.6) +
theme_bw()Based on these empirical results, we first observe that, in all the considered simulation models, MGLasso improves over GLasso in terms of support recovery in the high-dimensional setting where p<n. In addition, in the absence of a total variation penalty, i.e., \lambda_2 = 0, MGLasso performs no worse than GLasso in each of the 3 models. However, for \lambda_2>0, increasing penalty value does not seem to significantly improve the support recovery performances for the MGLasso, as we observe similar results for \lambda_2=3.3,6.6,10. Preliminary analyses show that, as \lambda_2 increases, the estimates of the regression vectors are shrunk towards 0. This shrinkage effect of group-fused penalty terms was also observed in (Chu et al. 2021).
4.3 Clustering
In order to obtain clustering performance, we compared the partitions estimated by MGLasso, Hierarchical Agglomerative Clustering (HAC) with Ward’s distance and K-means to the true partition in a stochastic block model framework. Euclidean distances between variables are used for HAC and K-means. The criterion used for the comparison is the adjusted Rand index. We studied the influence of the correlation level inside clusters on the clustering performances through two different parameters: \rho \in \{ 0.1, 0.3 \}; the vector of cluster proportions is fixed at \mathbf \pi = (1/5, \dots, 1/5). We then simulate 100 Gaussian data sets for each simulation configuration (\rho, n/p fixed).The optimal sparsity penalty for MGLasso is chosen by the Stability Approach to Regularization Selection (StARS) method (Liu, Roeder, and Wasserman 2010), and we vary the parameter \lambda_2.
Show the code
load("./data/rand_dt_lower_cor_sbm.RData")
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.25, np_ = c(0.5, 1, 2),
main = "")Show the code
load("./data/rand_dt_higher_cor_sbm.RData")
plot_res(dt_rand, crit_ = "rand", ncluster_ = c(5, 10, 15, 20), cor_ = 0.95, np_ = c(0.5, 1, 2),
main = "")The results shown in Figure 7 and Figure 8 demonstrate that, particularly at the lower to medium levels of the hierarchy (between 20 and 10 clusters), the hierarchical clustering structure uncovered by MGLasso is comparable to popular clustering methods used in practice. For the higher levels (5 clusters), the performances of MGLasso deteriorate. As expected, the three compared methods also deteriorate as the level of correlation inside clusters decreases.
5 Analysis of microbial associations data
We finally illustrate our new method of inferring the multiscale Gaussian graphical model, with an application to the analysis of microbial associations in the American Gut Project. The data used are count data that have been previously normalized by applying the log-centered ratio technique as used in (Kurtz 2015). After some filtering steps (Kurtz 2015) on the operational taxonomic units OTUs counts (removed if present in less than 37\% of the samples) and the samples (removed if sequencing depth below 2700), the top OTUs are grouped in a dataset composed of n_1 = 289 for 127 OTUs. As a preliminary analysis, we perform a hierarchical agglomerative clustering (HAC) on the OTUs, which allows us to identify four significant groups. The correlation matrix of the dataset is given in Figure 9; variables have been rearranged according to the HAC partition.
Show the code
data(amgut1.filt, package = "SpiecEasi")
dta <- amgut1.filt
dta <- t(SpiecEasi::clr(dta + 1 , 1))
S <- cor(dta)
hclust_dta <- hclust(dist(t(dta)), method = "ward.D")
hclust_dta <- hclust(as.dist(1-S^2), method = "ward.D")
cut_dta <- stats::cutree(hclust_dta, 4)
Matrix::image(as(S[order(cut_dta), order(cut_dta)],
"sparseMatrix"),
sub = "", xlab = "", ylab = "",
colorkey = FALSE)The average correlations within blocks of variables belonging to the same cluster are given below. We observe relatively high levels of correlation in small blocks, similar to the simulation models used to evaluate the performance of clustering in the Simulation Experiments section.
Show the code
C <- cor(dta)
diag(C) <- 0
clusters <- cut_dta
seq_p <- 1:length(clusters)
L <- split(seq_p, factor(clusters))
mat <- t(sapply(L,
FUN = function(v) {
summary(as.vector(C[v,v]))
}))
out <- data.frame(Cluster = 1:4, "Mean correlation" = round(mat[, "Mean"], 3))
knitr::kable(out)| Cluster | Mean.correlation |
|---|---|
| 1 | 0.013 |
| 2 | 0.815 |
| 3 | 0.555 |
| 4 | 0.566 |
We apply MGLasso to the normalised counts to infer a graph and a clustering structure. The graphs obtained by MGLasso for \lambda_2 = 0 and for \lambda_2 = 5 (corresponding respectively 127 and 80 clusters) are given below. In each case, the parameter \lambda_1 is chosen by stability selection (see Model Selection section).
Show the code
knitr::include_graphics(c("figures/mglasso_tv0.png","figures/mglasso_full_graph_tv5.png"))The variables were reordered according to the clustering partition obtained from the distances between the regression vectors. Increasing \lambda_2 reduces the number of clusters and leads to a shrinking effect on the estimates. The adjacency matrix of the neighborhood selection equivalent to setting \lambda_2 to 0 is given in Figure 10 (left). In Figure 10 (right), the deduced partition is composed of 80 clusters. A confusion matrix comparing the edges deduced by MGLasso with \lambda_2 = 5 and neighborhood selection is given below. Adding a total variation parameter increases the merging effect, resulting in a larger number of edges in the graph.
| Neighborhood selection | |||
|---|---|---|---|
| MGLasso (\lambda_2 = 5) | non-edges | edges | |
| non-edges | 15678 | 0 | |
| edges | 288 | 163 |
6 Conclusion
We proposed a new technique that combines Gaussian Graphical Model inference and hierarchical clustering called MGLasso. The method proceeds via convex optimization and minimizes the neighborhood selection objective penalized by a hybrid regularization combining a sparsity-inducing norm and a convex clustering penalty. We developed a complete numerical scheme to apply MGLasso in practice, with an optimization algorithm based on CONESTA and a model selection procedure. Our simulations results over synthetic and real datasets showed that MGLasso can perform better than GLasso in network support recovery in the presence of groups of correlated variables, and we illustrated the method with the analysis of microbial associations data. The present work paves the way for future improvements: first, by incorporating prior knowledge through more flexible weighted regularization; second, by studying the theoretical properties of the method in terms of statistical guarantees for the MGLasso estimator.
Session information
Show the code
sessionInfo()R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
[3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
[5] LC_TIME=French_France.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.4-1 huge_1.3.5 SpiecEasi_1.1.2 simone_1.0-4
[5] blockmodels_1.1.5 ggplot2_3.3.6
loaded via a namespace (and not attached):
[1] shape_1.4.6 tidyselect_1.1.2 xfun_0.31 purrr_0.3.4
[5] splines_4.2.1 lattice_0.20-45 colorspace_2.0-3 vctrs_0.4.1
[9] generics_0.1.3 htmltools_0.5.3 stats4_4.2.1 yaml_2.3.5
[13] utf8_1.2.2 survival_3.3-1 rlang_1.0.4 pillar_1.8.1
[17] glue_1.6.2 withr_2.5.0 DBI_1.1.3 foreach_1.5.2
[21] lifecycle_1.0.1 stringr_1.4.1 munsell_0.5.0 gtable_0.3.0
[25] mvtnorm_1.1-3 htmlwidgets_1.5.4 codetools_0.2-18 VGAM_1.1-7
[29] evaluate_0.16 labeling_0.4.2 knitr_1.40 fastmap_1.1.0
[33] parallel_4.2.1 fansi_1.0.3 highr_0.9 Rcpp_1.0.9
[37] scales_1.2.0 captioner_2.2.3 jsonlite_1.8.0 farver_2.1.1
[41] pulsar_0.3.7 digest_0.6.29 stringi_1.7.8 dplyr_1.0.9
[45] grid_4.2.1 cli_3.3.0 tools_4.2.1 magrittr_2.0.3
[49] glmnet_4.1-4 tibble_3.1.8 pkgconfig_2.0.3 MASS_7.3-57
[53] assertthat_0.2.1 rmarkdown_2.16.1 rstudioapi_0.13 iterators_1.0.14
[57] R6_2.5.1 igraph_1.3.4 compiler_4.2.1
References
Reuse
Citation
@article{sanou,
author = {Edmond Sanou and Christophe Ambroise and Geneviève Robin},
title = {Inference of {Multiscale} {Gaussian} {Graphical} {Model}},
journal = {Computo},
date = {},
url = {https://github.com/desanou/multiscale_glasso},
doi = {xxxx},
langid = {en},
abstract = {Gaussian Graphical Models (GGMs) are widely used for
exploratory data analysis in various fields such as genomics,
ecology, psychometry. In a high-dimensional setting, when the number
of variables exceeds the number of observations by several orders of
magnitude, the estimation of GGM is a difficult and unstable
optimization problem. Clustering of variables or variable selection
is often performed prior to GGM estimation. We propose a new method
allowing to simultaneously infer a hierarchical clustering structure
and the graphs describing the structure of independence at each
level of the hierarchy. This method is based on solving a convex
optimization problem combining a graphical lasso penalty with a
fused type lasso penalty. Results on real and synthetic data are
presented.}
}